Titanic Data Analysis By Gangadhara Naga Sai

Overview Of Titanic Dataset

In 1912, the ship RMS Titanic struck an iceberg on its maiden voyage and sank, resulting in the deaths of most of its passengers and crew. In this project, we will explore the RMS Titanic passenger manifest to determine whether someone survived or did not survive.Demographics and passenger information from 891 of the 2224 passengers and crew on board the Titanic Dataset is obtained from kaggle (https://www.kaggle.com/c/titanic/data).

Data Wrangling


In [2]:
import numpy as np
import pandas as pd
from IPython.display import display

%matplotlib inline

# Load the dataset
files = "titanic_data.csv"
data_titanic = pd.read_csv(files)
display(data_titanic.head())


PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Data Description

From a sample of the RMS Titanic data, we can see the various features present for each passenger on the ship:

  • Survived: Outcome of survival (0 = No; 1 = Yes)
  • Pclass: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
  • Name: Name of passenger
  • Sex: Sex of the passenger
  • Age: Age of the passenger (Some entries contain NaN)
  • SibSp: Number of siblings and spouses of the passenger aboard
  • Parch: Number of parents and children of the passenger aboard
  • Ticket: Ticket number of the passenger
  • Fare: Fare paid by the passenger
  • Cabin Cabin number of the passenger (Some entries contain NaN)
  • Embarked: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)

Variable Notes

pclass: A proxy for socio-economic status (SES)

  • 1st = Upper
  • 2nd = Middle
  • 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.


In [2]:
data =data_titanic

# Show the dataset 
display(data.head())
data.info()


PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

From the above info(),We can see columns Age, Cabin and Embarked have missing values.

Handling the missing values:

Ignore the rows with missing data,

Exclude the variable at all or we might substite it with mean or median.

Age 80% of the data is available,which seems a important variable so not to exclude.

Port of embarkation doesn't seem interesting.

cabin 23% of the data so decided to exclude.

PassengerId,Name,fare doesnt seem to contribute to any survival investigation


In [3]:
#exculding some coloumns
for a in ['Ticket','Cabin','Embarked','Name','PassengerId','Fare']:
    if a in data.columns:
        del data[a]

In [4]:
print "Age median values by Age and Sex:"
#we are grouping by gender and class and taking median of age so we can replace with corrresponding values instead of NaN
print data.groupby(['Sex','Pclass'], as_index=False).median().loc[:, ['Sex','Pclass', 'Age']]
print "Age values for 5 first persons in dataset:"
print data.loc[data['Age'].isnull(),['Age','Sex','Pclass']].head(5)
# apply transformation: Age missing values are filled with regard to Pclass and Sex:
data.loc[:, 'Age'] = data.groupby(['Sex','Pclass']).transform(lambda x: x.fillna(x.median()))
print data.loc[[5,17,19,26,28],['Age','Sex','Pclass']].head(5)
data['Age'] = data['Age'].fillna(data['Age'].mean())


Age median values by Age and Sex:
      Sex  Pclass   Age
0  female       1  35.0
1  female       2  28.0
2  female       3  21.5
3    male       1  40.0
4    male       2  30.0
5    male       3  25.0
Age values for 5 first persons in dataset:
    Age     Sex  Pclass
5   NaN    male       3
17  NaN    male       2
19  NaN  female       3
26  NaN    male       3
28  NaN  female       3
     Age     Sex  Pclass
5   25.0    male       3
17  30.0    male       2
19  21.5  female       3
26  25.0    male       3
28  21.5  female       3

Data Exploration and Visualization


In [6]:
data_s=data
survival_group = data_s.groupby('Survived')
survival_group.describe()


Out[6]:
Age Parch Pclass SibSp
Survived
0 count 549.000000 549.000000 549.000000 549.000000
mean 29.737705 0.329690 2.531876 0.553734
std 12.818264 0.823166 0.735805 1.288399
min 1.000000 0.000000 1.000000 0.000000
25% 22.000000 0.000000 2.000000 0.000000
50% 25.000000 0.000000 3.000000 0.000000
75% 37.000000 0.000000 3.000000 1.000000
max 74.000000 6.000000 3.000000 8.000000
1 count 342.000000 342.000000 342.000000 342.000000
mean 28.108684 0.464912 1.950292 0.473684
std 14.010565 0.771712 0.863321 0.708688
min 0.420000 0.000000 1.000000 0.000000
25% 21.000000 0.000000 1.000000 0.000000
50% 27.000000 0.000000 2.000000 0.000000
75% 36.000000 1.000000 3.000000 1.000000
max 80.000000 5.000000 3.000000 4.000000

From the above statistics

  • Youngest to survive: 0.42
  • Youngest to die: 1.0
  • Oldest to survive: 80.0
  • Oldest to die: 74.0

In [7]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for all graphs
#sns.set_style("light")
#sns.set_style("whitegrid")
sns.set_style("ticks", {"xtick.major.size": 8, "ytick.major.size": 8})

From the above plot we can see how female individuals are given 1st preference and based on class.

Social-economic standing was a factor in survival rate of passengers by gender

  • Class 1 - female survival rate: 96.81%
  • Class 1 - male survival rate: 36.89%

  • Class 2 - female survival rate: 92.11%

  • Class 2 - male survival rate: 15.74%

  • Class 3 - female survival rate: 50.0%

  • Class 3 - male survival rate: 13.54%

Women and children have preference First to lifeboats?


In [13]:
def group(d,v):
    if (d == 'female') and (v >= 18):
        return 'Woman'
    elif v < 18:
        return 'child'                        
    elif (d == 'male') and (v >= 18): 
        return  'Man'

data['Category'] = data.apply(lambda row:group(row['Sex'], row['Age']), axis=1) 
data.head(5)


Out[13]:
Survived Pclass Sex Age SibSp Parch group_age Category
0 0 3 male 22.0 1 0 Adults Man
1 1 1 female 38.0 1 0 Adults Woman
2 1 3 female 26.0 0 0 Adults Woman
3 1 1 female 35.0 1 0 Adults Woman
4 0 3 male 35.0 0 0 Adults Man

In [ ]:
# We are dividing the Age data into 3 buckets of (0-18),(18-40),(40-90)
# and labeling them as 'Childs','Adults','Seniors' respectively
data['group_age']  = pd.cut(data['Age'], bins=[0,18,40,90], labels=['Childs','Adults','Seniors'])

#finding mean Survival rate grouped by 'group_age','Sex'.
df = data.groupby(['group_age','Sex',"Pclass"],as_index=False).mean().loc[:,['group_age','Sex',"Pclass",'Survived']]
df.to_csv("titanic_group_age.csv", sep=',', encoding='utf-8')

In [17]:
data_C=data.groupby(['Category',"Parch"]).mean()
data_C.sort("Survived")["Survived"]
data_C.to_csv("Category.csv", sep=',', encoding='utf-8')


C:\Users\SAI\Anaconda2\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  from ipykernel import kernelapp as app

In [31]:
data['Age_group']  = pd.cut(data['Age'], bins=range(0,90,10))
data_age=data.groupby(["Age_group"]).mean()

data_age.to_csv("Age_group.csv", sep=',', encoding='utf-8')

In [10]:
%run P2


PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
Age median values by Age and Sex:
      Sex  Pclass   Age
0  female       1  35.0
1  female       2  28.0
2  female       3  21.5
3    male       1  40.0
4    male       2  30.0
5    male       3  25.0
Age values for 5 first persons in dataset:
    Age     Sex  Pclass
5   NaN    male       3
17  NaN    male       2
19  NaN  female       3
26  NaN    male       3
28  NaN  female       3
     Age     Sex  Pclass
5   25.0    male       3
17  30.0    male       2
19  21.5  female       3
26  25.0    male       3
28  21.5  female       3
P2.py:181: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  data_C.to_csv("Category.csv", sep=',', encoding='utf-8')

From the above values we can see that the survival rate is increasing from top to bottom. And the from the plot we can see the distribution of survival rate among men ,women and children,based on class.

Conclusion

We observe a order of survival rate based on Age ,Sex and Class:

children and women of upper class
children and women of middle class
women of lower class
children of lower class
men of upper class
finally men of the middle class and lower class have least survival rate

The analysis seems that , A female with upper social-economic standing (first class) and Children,had the best chance of survival. Age did not seem to be a major factor.Man in third class, had the lowest chance of survival. Women and children of all classes, were mostly having a higher survival rate than men in general.

Limitations:

  • Part men and women were missing Age data and were replaced, grouping by gender and class and taking median of age so we can replace with corrresponding values instead of NaN as calculations which could have skewed.